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Abstract 


Medical image analysis plays a major role in providing quality health care. Better 
imaging techniques have enabled effective diagnosis of various diseases. Automation 
of the diagnostic processes is essential because the manual methods require consider- 
able amount of time, effort and care, besides being prone to errors. It also facilitates 
testdata collection for the people in remote areas, where expert physicians are not 
widely available. 

This thesis presents a step towards testing of peripheral blood. Our works aims 
at developing an automatic and reliable system for obtaining the Differential blood 
count (DLC), an important measure for blood-related diseases. We present an ef- 
fective technique for identification and segmentation of white blood cells in smear 
images using autothresholding and the watershed algorithm. Feature extraction 
and classification based on the shape-color and the texture based feature was per- 
formed with peak accuracy of 80%. The performance of the system was satisfactory 
considering the scarcity of data and relatively poor quality of images. 
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Chapter 1 


Introduction 


1.1 Medical Image Analysis 

Aim of medical image analysis is to develop systems that are capable of process- 
ing medical images required for diagnosis. The primary purpose of medical image 
analysis is to extract relevant information from images in order to facilitate unam- 
biguous detection of abnormalities. Such systems can also be used for visualization. 
Computer-based descriptions are often more consistent than those derived by hu- 
man observers. The descriptions can include shape, color, pattern, texture, and 
other image features. The increasing a'vailability of computing power and appro- 
priate modeling techniques have enabled rapid development of medical systems for 
quantitative image analysis that support disease detection, therapy planning and 
medical education. 

A wide range of tasks in medical image analysis include image enhancement, 
segmentation, noise removal, pattern detection depending upon the requirement. 
For example, MRI image segmentation [&] helps in diagnosis, but sometimes the low- 
contrast MRI images need enhancement before being used for diagnosis! 34], while the 
ultrasound fetal images would need to be analyzed for textures in order to determine 
lung maturity [5]. Patterns in bone images are recognized for forensic applications 
as well as for the determination of age, study of age-related bone developments and 
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•one diseases[27]. Angiogram, the X-ray image of network blood vessels needs to be 
irocessed for suppressing the shadows created by bones, in order to enable correct 
liagnosis[38l. The images of skin moles need processing to extract descriptions that 
an aid the diagnosis of melanoma, a skin cancer(l7). Similarly, segmentation of 
putum images helps in the diagnosis -of lung cancer[35]. Range of diseases caused 
)y disorders in blood[29] can be diagnosed with the help of blood smears. In this 
hesis, we have worked with color images of blood smears to detect and classify 
vhite blood cells. 


1.2 Composition of Blood 

3lood is a fluid tissue flowing through the circulatory system transporting the di- 
gested food substances, excretory products, and dissolved gases. It is composed of 
^ious types of blood cells suspended in a fluid called plasma. The types of blood 
:ells include, 

• Red Blood Cells (RBC) or Erythrocytes, which carry oxygen from the lungs 
to the rest of the body. 

• White Blood Cells (WBC) or Leukocytes, which help fight infections and aid 
in the immune process. 

• Platelets or Thrombocytes, which help in blood clotting. 

Blood cells[15] are formed in the bone marrow. In the initial phase, they are 
called “stem cells” or “hematopoietic cells”. As the stem cells mature, distinct cells 
)f each type evolve. 

WBC’s are responsible for the defense system in the body by fighting infections. 
They are much bigger in size and fewer in number than RBC’s. There are approxi- 
mately 6,000 WBC’s per cubic millimeter of blood or half a million WBC’s in every 
irop of human blood. The WBC’s have a life-span ranging from few hours to few 
lays. When they die, the dead ones arc engulfed by the surrounding WBC’s and the 
lead cells are replaced with new ones. WBC’s include immature and mature types. 
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Studies show that WBC’s display features that continuously evolve from the primi- 
tive forms to the mature cell types," making the initial and final features extremely 
different. This feature variability makes WBC classification a difficult task. Imma- 
ture types include unsegmented neutrophils, blasts, variant lymphocytes, proery- 
throblast, myeloblast, erythroblast and monoblast. Mature cells can be divided into 
5 classes, namely: neutrophils (50-70%), lymphocytes(25-35%), monocytes (4-10%), 
eosinophils(less than 5%), basophils (fewer than 1%). 

Complete Blood Count: Complete blood cell count is the measurement of size, 
number, and maturity of the different blood cells in a specific volume of blood, 
usually a microliter. This is used to determine abnormalities with either the pro- 
duction or destruction of blood cells. Variations from the normal number, size, or 
maturity of the blood cells is an indication of infection or disease, like leukemia, 
anemia and sickle cell disease. One of the steps in ‘Complete Blood Count’ is to 
perform ‘Differential Blood Count’. 

Differential Blood Count.'Differential blood count is specific to WBC’s. It it 
carried out to calculate the relative percentage of each type of WBC, since it helps 
in diagnosing the cause of many ailments. In a normal person, there are about 3150 
to 6200 neutrophils, 1500 to 3000 lymphocytes, 300 to 500 monocytes, 50 to 250 
eosinophils and 15 to 50 basophils, per microliter of blood. Changes in these counts 
axe indicators of a disease. For instance, a high neutrophil count would suggest 
infection or cancer or physical stress. High lymphocyte counts are usually due 
to Acquired Immune Deficiency Syndrome (AIDS). High monocyte and eosinophil 
count usually point at bacterial infection. Thus differential blood count is indicative 
of one’s health status. Both conventional and automatic methods are prevalent in 
computing it accurately. 

1.3 Motivation 

Manual method for differential blood count uses a stained slide. The technician 
typically studies 100 WBC’s to determine the type of each of the cells, in order to 
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calculate their relative percentages. This method suffers from several pitfalls. It’s 
not only time-consuming but the quality of the results highly depends on the tech- 
nician’s skill and experience. Manual analysis is questionable because of precision 
and the poor reproducibility of the results apart from the amount of work involved. 

Some automatic methods utilize fluid properties of blood for counting purpose. 
WBC counters based on flow cytometry|l] are in use. They utilize the Coulter 
Principle[12] of impedance measurement for a liquid-dispersed blood flow. The 
distribution of RBC’s, WBC’s and the classification of a couple of classes of WBC’s 
are performed using laser light scattering from stationary suspensions[36]. These 
methods rely on haemetological practice, but at the same time they ignore rich visual 
information available in image. Hence image processing offers a better alternative 
for the task. 

Automated image processing based robust systems can overcome manual errors. 
Besides, Automated systems can help overcome scarcity of trained personnel. Given 
a full-fledge system for differential blood count, the only manual effort required 
would be to prepare a stained slide, acquire images with a microscope and hand it 
over to the system, which can easily be done by suitably training a person. 

1.4 Organization of Thesis 

Chapter 2 describes the typical steps in automated WBC counting and summa- 
rizes the past work for automatic and semi-automatic systems for differential blood 
counting based on image processing. 

Chapter 3 presents the segmentation scheme cmidoyed for the system. The tech- 
nique based on A:-means clustering followed by autothresholding and the watershed 
algorithm is proposed and results of the same are presented. 

Chapter 4 discusses various shape based, color based and texture based features 
from the cell images in order to facilitate the feature extraction process. Along with 
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the features finally chosen to fonu the feature vector and the feature rejected arc 
also discussed. 

Chapter 5 presents the comparative study of classification task and presents results 
for the same. The relative performance of the classifiers is tabulated. 

Chapter 6 contains the concluding remarks and possible directions for the future 
work. 



Chapter 2 

Overview and Previous Work 


.1 Overview 


system for automatic differential blood count aims to distinguish between the five 
asses of mature WBC’s, namely lymphocyte, monocyte, eosinophil, basophil and 
eutrophil. Typically, the input to the system is a digital image of blood smears 
ad the output is the differential count of the type of cells. 



Figure 2.1: Block Diagram for typical Differential Blood Count System 


The block diagram of the overall system is depicted in f^igur(', 2.1. llie main 
ages of system are : 

cquisition: Tliis is the procc'ss of capturing imagr^s cff blood sriHS'us. !n our 
rse, color images are captured using a digital camera mounted on a microscope. 

egmentation: This is a very crucial step, because this step estimates the shape 
' WBC and thus efficiency of the subsequent stfages depends on this. In this stage, 
/BC’s are extracted from the background and other constituents of l)lood sucli as 


6 







RBC’h, plasma, platok’ls aad ccIl-fragmcnts. Furthor, distinction between nnclens 
and cytoplasm for each cell is accomplislu'd. 

Feature Extraction: Features extracted from segmented cells are generally shape- 
based, color-based and texture-based. Shape-based features include eccentricity of 
the nucleus, eccentricity of the cytoplasm, area-ratio between the cytoplasm and 
the nucleus, number of nucleus lobes etc. Color-based features used are average 
red, blue, green component for nucleus and cytoplasm. While various texture-based 
features are energy, entropy, correlation etc. It has been observed that shape-based 
features are most important features. 

Classification: Based on the features extracted, a set of feature vectors, repre- 
senting all 5 types of WBC’s is created. This set is used to train the classification 
model. After training, an unseen sample of WBC can be classified as one of the 5 
types. For classification purpose, widely used classifiers are neural networks, support 
vector machines and bayesian classifier. 

2.2 Previous work 

Previous work in Automated Differential blood count has been done at various levels. 
Some techniques have only worked on detecting WBC’s in the image [39, 41], while 
some others have been successful in segmentation and classification of blood cells 
also [24, 25, 30]. Most of the work has been carried out on healthy and mature cells, 
but a couple of works [3, 30] also include immature cells for classification. Some of 
the techniques have used gray images as input [6, 39], while recent work has been 
done on color images [24, 25]. The important advantage of gray scale images is that 
they are less sensitive to variations of lighting conditions and require less processing 
time and storage as compared to color images, but at the same time important 
color information which could be useful for segmentation and classification remains 
unexploited. A review of some of the earlier work is presented. 



5.1 Ideiitilication of WBC’s 


3 san Sheikh et al. [39] presented a method to differentiate between RBC’s, WBC’s 
I platelets from gray images on the basis of size, shape, volume of the cells and 
sence of nucleus. Cells were manually segmented from the image, followed by 
.velet based feature extraction. Finally, an ALOPEX neural network [43] was used 
classification. Training and testing was done on 1 1 and 9 cell images respectively, 
ich is a very small database, but they claim an accuracy of 89%. Park et al. 
j suggest a method based on gray scale bone marrow images for distinguishing 
erent type of cells. The technique uses the watershed algorithm to perform 
irsegmentation to create initial “patches” and patch labels are adjusted using 
itext information till convergence. However, they do not specify any objective 
Juation for offectiveness of the technique. Sobrevilla ct al. [41] have reported 
iomatic WBC detection in gray scale bone marrow images. Gefurietrical, textural 
1 morphological information are used to make fuzzy rules in order to detect the 
3C’s. Accuracy of 93% has been reported for detection of WBC’s. Wei et al. 
] used neural networks to distinguish between the RBC’s, WBC’s and plat(!lets. 
ey proposed a “boundary following” technique to retain details of contour of cells, 
ey used scanned images from Atlas of Blood Cells [47] for their input. They claim 
)% accuracy when boundary detection is accurate, however the tcclmiquc is not 
ally automatic. It expects threshold values for differentiation of cells and starting 
int for boundary detection from the user. 

dengen et al. [21] have reported a technique to follow the identification stage, 
ey presented a declustering method to handle overlapping cells. Thresholding the 
tance transform, followed by a region-growing algorithm is u.sed for such cells, 
inford et al. [2] have achieved accurate nucleus .segmentation for {)ap-stained 
vical images, which can be extended for blood cell segmentation. 

2.2 Segmentation of WBC’s 

maniciu et al. [10] use non-Gaussian cluster in L * u* v color space. Their cell 
:mentation algorithm detects clusters in L * n * v color space and refines their 
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border using gradient ascent mean shift procedure [11]. Katz [24] extracts region of 
interest based on thresholding. The segmentation of subimage into coll and non-cell 
regions is carried out using Canny Edge Detection followed by circle identification. 
However, the threshold was selected empirically and and circle identification required 
manual intervention. Wenn.ser et nl. [45] used Hierarchical Thresholding based on 
chromatic properties of background and cell components. Kovalev et al. [25] uses 
a three-step algorithm of extraction of nucleus, circle-shaped approximation and 
improvement in cytoplasm region using a priori information. Cseke [13] implemented 
a fast segmentation technique which utilised Otsu’s Automatic Thresholding Method 
[31]. However, they do not differentiate between red blood cells and cytoplasm. 

2.2.3 Classification of WBC’s 

Bikhet et al. [6] have worked on segmentation and classification of 5 types of WBC’s. 
Segmentation was achieved using Hierarchical Thresholding based on Histogram 
Entropy Classification and Iterative Threshold Selection [22, 33|. It is claimed 
that the lO-dimensional feature vector, shape-based and color-ljased, achieved ac- 
curacy of 90% on 71 cells. Ongun et al. [30] worked on color images containing 
both mature and immature cells. Segmentation was accomplished by morphologi- 
cal preprocessing combined with fuzzy patch labeling. 57-dimensions consisted of 
shape-based (area of nucleus and cytoplasm, ratio of nucleus to cell area etc.), color- 
bjiscd(Color historgram, mean and standard deviation of components in CIE-Lab 
domain), texture-based(contrast, homogeneity, entropy derived from the gray-level 
co-occurence matrix) features. While various classifiers have been used, peak perfor- 
mance has been achieved with 91% using SVM. Sinha [40] has worked on very good 
quality, colour images for 5 types of mature cells. Segmentation was done in two 
parts : Coarse segmentation was achieved using A:-means, followed by fine segmen- 
tation using Expectation-Maximization Algorithm and achieved 80% segmentation 
accuracy. Features extracted were eccentricity of the nucleus and the cytoplasm, 
compactness of the nucleus, number of nucleus lobes, average red, green, blue com- 
ponents for the nucleus and the cytoplasm and texture-based feature namely energy, 
entropy and correlation. For classification various classifiers like nearest neighbour. 
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fc-nearest uoighbout, neiglited fc-iiearcst neighbour, bayesian classifier, S\^M and 
leural networks have been used with SVM giving peak performance of 94%. 
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Chapter 3 
Segmentation 


3.1 Introduction 

A typical blood smear consists of WBC’s, RBC’s, plasma, platelets and cell frag- 
ments. The goal of segmentation is to locate the WBC’s in the smear image and 
mark the boundaries of nucleus and cytoplasm regions. This is necessary before 
further processing to classify them as one of the 5 classes of WBC’s. This stage 
is very crucial because accuracy of classification will largely depend on outcome of 
segmentation. 

Most techniques proposed so far are sensitive to the right selection of some pa- 
rameters such as, image acquisition conditions [40], threshold selection [24], initial 
contour [30]. Also some of the techniques [24] assume circular shape for white blood 
cells, which is not true in most cases. While Cseke et a/. [13] propo.sed a robust 
technique for segmentation, but further segmentation between cytoplasm and red 
olood cells was not attempted. We report a two-stage segmentation scheme that 
enables us to distinguish the cytoplasm and nucleus of WBC from the input image 
)f a blood smear and requires no manual interaction for paramet er tuning. 
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3.2 Segmentation 

3.2.1 Overview 

The figure 3.1 shows an overview of the proposed segmentation scheme. We first lo- 
cate the nuclei of the cells using fc-means clustering on the Hue-Saturation- Value(HSV) 
equivalent of the image. We then crop a rectangular region around each nucleus such 
that it encompasses the entire cell. The block diagram of the process for obtaining 
smaller images with only one cell, is shown in figure 3.1(a). Further processing is 
carried out on a gray version of these sub-images. Autothresholding [13], followed by 
declustering using the watershed algorithm is used to obtain segmented cytoplasm 
and nucleus regions. Results are further refined by choosing clusters that belong 
to the cytoplasm. The schematic for the second level of segmentation is shown in 
figure 3.1(b). 

3.2.2 HSV-Space 

HSV-space [16] is considered to be important for segmentation algorithms. Con- 
version to HSV in one approach to decouple the intensity component (Value) from 
the color information (Hue). This closely corresponds to the way color is perceived, 
rather than as superposition of the primary colors as in the RGB model. The HSV 
model is also useful for quantifying the purity of the color (Saturation). 

A major difficulty with using color cues in machine vision is the color constancy 
problem which arises due to variation in color values brought about by lighting 
changes. This is particularly apparent in RGB space. Intensity is distributed 
through-out all three parameters, rendering color values highly sensitive to scene 
brightness. A simple approach to color constancy is to use the HSV color space 
which consists of hue angle (H), color saturation (S) and brightness(V). In order to 
obtain a limited level of intensity invariance, color can be modeled in HS-space. 
Hence this color space is generally preferred for segmentation algorithm.s. 


12 




(a) Generation of sub-images containing single 
cells 



(b) Segmentation of nucleus and cytoplasm 


Figure 3.1: Overview of the Scgiiientation 
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Our approach also performs segmentation in HSV space. The RGB images are 
converted to their HSV equivalent using the following equations: 


H = 

S = 
V = 


cos 


l[(R-Gf + (R-B) (G-B)]q 

3 

min{R, G, B) 


R + G + B 


-(R + G + B) 


( 1 ) 

(2) 

(3) 


Figure 3.2 shows a histogram of S-component of a typical cell, which depicts 
distinct peaks corresponding to each of the regions in the blood-smear, in which 
high values of saturation correspond to the WBC-nucleus. This feature helps us in 
identifying the cluster belonging to the nucleus. 



Figure 3.2: Histogram of S-component of a typical cell image 
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S, V. Tims, an image is a set of 3-component vectors in HSV space. For clustering 
purpose, all input values are normalized between 0 and 1. Smoothening is then 
performed on the image using a low-pass fdter, by averaging over a windows size of 
5X5. /c-means clustering is performed on this set of vectors. As mentioned in [40], 
we have used 5 clusters in our experiment, because apart from clusters for nucleus, 
cytoplasm, RBC’s and background, the rim of the RBC’s al.so forms a separate 
cluster. The centroids are initialized by choosing k uniformly distributed points in 
the vector space. Euclidean distance is used as a measure of dissimilarity. When 
the difference between successive values of eacli centroid is less than a pre-defined 
threshold, clustering is said to have converged. At the end of clustering, each pixel 
is a member of one of the k clusters and centroid for each cluster is obtained. 


Among these clusters, we can say that the centroid with maximum saturation 
corresponds to the nucleus cluster. We then crop a rectangular region around nucleus 
of sufficient area such that it contains the entire cell. Thus a set of sub-images, 
containing a single WBC are obtained. Further segmentation is achieved using 
Autothresholding and Declustering using the watershed algorithm. 


3.2.4 Autothresholding 

Each sub-image is separately processed for this stage. It luvs been observed that 
given the gray scale image of any cell, dark regions correspond to the nucleus, bright 
regions correspond to background and intermediate regions correspond to cytoplasm 
and RBC’s. So, thresholding the image into three classes, separate cell structures 
from one another, except cytoplasm and RBC’s. We use automatic threshold se- 
lection proposed by Otsu [31]. In this method, optimal thresholds IT and T2 are 
selected by maximizing interclass variance between dark, gray and bright regions. 
Cseke et al. [14] proved that maximizing interclass variance can be reduced to max- 
imizing the function in equation 4. 


E{Tl,T2) 


(0, Tl) 7 m 2 (7T, T2) rn^ (7’2, L) 

n(0,ri) ^ n(ri,T2) n{T2,L) 


(4) 
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Figure 3.3: Original Image and Segmented Image 


E{T1,T2) = 


m2 (0, Tl) m2 (Tl, T2) w? {T2, L) 
n(0,ri) n(ri,r2) ^ n(T2,L) 


( 4 ) 


where, L denotes number of gray levels (255 in our case) and m () and n () denote 


following expressions : 


y—l y—1 

m(x,y) = = y>^ ( 5 ) 

i=x i—x 

where, H [] denotes histogram of sub-image to be thresholded. 

Further, Reddi et al. [37] proved that function E is maximized when equations 
6,7 are satisfied, which can be easily solved using iterative algorithms. 


rr^(0,rl) m(ri,T2) _ 

n(0,ri) ^ n(Tl,r2) ^ ’ 

m{Tl,T2) m{T2,L) _ 

n(Tl,T2) n{T2,L) ^ ^ 

Using threshold Tl and T2, we achieve coarse segmentation (see figure 3.3). 
As mentioned, using this autothresholding technique, we label pixel as belonging to 
either cytoplasm or RBC’s. Declustering is done to separate cytoplasm from RBC’s. 
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Figure 3.4: Negative Distance Transform 


3.2.5 Declustering using watershed algorithm 

Declustering technique involves concept of negative distance transform and the wa- 
tershed algorithm. 

Negative Distance Transform 

The concept of distance transform [22] has been defined for binary images. It is 
computed for every foreground pixel, as the distance between the foreground pixel 
considered and the nearest background pixel. The result of the transform is a gray 
level image that looks similar to the input image except that the gray intensity of 
the points in the foreground region are changed to show the distance to the closest 
background pixel (see figure 3.4). The distance metric chosen is Euclidean, however 
other metrics can also be adopted. Distance transform is defined as: 

D{f) = {p:p^ min (/, 6) ; G B} (8) 

where, / is the foreground pixel and B is the set of all background pixels in the 
image. Negative of this metric is defined as the Negative Distance transform. 

iV(/) = 255-D(/) (9) 


Watershed algorithm 

The watershed algorithm [4, 23] is a fundamental image segmentation tool in math- 
ematical morphology. The watershed transform is based on an analogy with topo- 
graphic reliefs. An image can be thought of as a three dimensional relief with the 
grayscale value at each point corresponding to height. Imagine that the relief has 
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Figure 3.5: Visualization of Watershed Algorithm 

a point. Once the relief has become completely covered by water, we end up with a 
structure with several barriers or dams on it. These dams represent the watershed 
lines and serve to separate the “catchment basins” of the relief (figure 3.5). One of 
the main advantages of the watershed transform as a segmentation tool is that the 
segment boundaries it produces are closed. 

The watershed algorithm is described by Vincent et al. [44] . The set of the catch- 
ment basins of the grayscale image I is equal to the set of obtained after the 

following recursion: 

( 10 ) 

where, hmin and hmax denote minimum and maximum gray level respectively in I. 
Xh,^i„ = {p G Di,I{p) < hmin} where Dj is the set of values taken by the image 
I and is the set of points which are first reached by water. These points 

constitute the starting set of the recursion. 

AT+i = imn \JlZn„{I)X,, ,V/,€|/ hiiim ^^ma.T-1 1 (11) 

where, 

IZa{B) = U iZA{Bi) (12) 

iZA{Bi) = peA,^j€ [1, k] - {r} , dA(p, Bi) < dAip, Bj) (13) 
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Figure 3.6; Oversegmentation : Cytoplasm mask and watershed segmented output 


where, 


IZa{B) = U iZA{Bi) (12) 

iZA{Bi) = p£ A, \/j e [1, k] - {i} , dAip, Bi) < dA{p, Bj) (13) 

Here, iZA{Bi) is the geodesic influence zone of the connected component Bi of B in 
A, deflned as the locus of the points of A whose geodesic distance to Bi is smaller 
than their geodesic distance to any other component of B. Geodesic distance dA{x, y) 
is deflned as the shortest path (if any) between x and y and totally included in region 
A. 


We apply watershed algorithm on the binary image of the cytoplasm. As a result 
of declustering we obtain a cluster which belongs to the cytoplasm. The advantage 
of using the watershed algorithm is that the contour information is not lost. In 
some cases, due to oversegmentation, cytoplasm is divided into several clusters, so 
in order to get binary mask for cytoplasm we need to merge these clusters. 

3.2.6 Merging 

The problem with the watershed algorithm is that it is very sensitive to sharp 
boundaries and in effect gives rise to the cytoplasm getting divided into multiple 
clusters rather than single cluster as shown in figure 3.6. 
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Figure 3.7: Final Segmentation 


We adopt a nucleus mask based merging technique for merging clusters. Here, we 
assume that every cluster from the cytoplasm mask which overlaps with a segment 
of the nucleus mask belongs to the cytoplasm and vice-versa. Using this technique, 
we merge all valid clusters and denote it as cytoplasm mask (see figure 3.7). 

3.3 Results 

The proposed scheme has been applied on images of peripheral blood smear slides 
obtained by Media Lab Asia’s Biomedical group at Indian Institute of Technol- 
ogy, Kanpur. Typical image size is 768x576. Sample input images are shown in 
figure 3.8. Segmentation output for various types of WBC’s are shown in Fig- 
ures 3.9(Lymphocyte), 3.10(Monocyte), 3. 11 (Eosinophil), 3.12(Neutrophil) and 3. 13 (Basophil). 
Our technique exhibits good results even in the case of varying brightness and in 
some cases poor contrast between cytoplasm and image background. 

Although our technique produces good results for most of the cases, it is not able 
to identify touching cells as different and considers them as a single cell. Further, 
if there is dense population of granulocytes, technique can not clearly distinguish 
nucleus as can be seen in Figure 3.11. 
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Figure 3.8: Sample input image (original size 768x576) 



Figure 3.9: Segmented output for typical lymphocyte 

*rfTf<?r So A..— 




Figure 3.10: Segmented output for typical monocyte 




Figure 3.11: Segmented output for typical eosinophil 
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Figure 3.12: Segmented output for typical neutrophil 




Figure 3.13: Segmented output for typical basophil 
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Chapter 4 


Feature Extraction 


4.1 Introduction 

Features are representative measures of a pattern and are chosen by tlicir ability to 
identify the input pattern. Feature extraction can be seen as a process of mapping 
the given data into useful features. 

The design of a feature extractor is highly application dependent. An ideal fea- 
ture extractor removes irrelevant and redundant information from data, preserving 
important discriminant information in order to ensure good class-separability. The 
advantages in working with features rather than the whole image are : 

• Reduction in computational complexity of pattern classification due to reduc- 
tion in dimensionality, resulting in efficient and faster classification 

• Reduction in space requirement, as feature data requires much lesser space 
than the entire image. 

4.2 Cell Structure 

WBC’s comprise of mainly two parts, the nucleus and the cytoplasm. Cliaracteristics 
of these two parts vary across different types of WBC’s, wliich enable us to chissify 



them as one of the five types. The quantitative features of both the cell-parts could 
be used to classify the cell. 

White blood cells are broadly classified as Granulocytes and Non-Granulocytes, 
based on presence or absence of granules in the cytoplasm [7, 45]. Visually, the 
differences in the color, size and spread of granules serve as vital cues for distin- 
guishing the different types of granulocytes. Non-granulocytes have a single-lobed 
nucleus while the granulocytes generally have multi-lobed nucleus. The cell-parts 
are characterized in terms of their shape and color. 

4.2.1 Non-Granulocytes 

The Non-Granulocytes are of 2 types ; Lymphocyte and Monocyte 



Figure 4.1: Typical Lymphocyte 


Lymphocyte : Lymphocyte is identified by the low value of the area-ratio between 
the cytoplasm and the nucleus, since the cytoplasm is present only along a thin 
rim around the nucleus. The shape of the nucleus is generally circular, as shown in 
figure 4.1. 

Monocyte : Monocyte generally shows a dent in the ellipsoidal nucleus. The 
shape and size of the dent are not consistent. Cytoplasm occupies fair share of total 
cell area unlike lymphocyte (Figure 4.2). 
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Figure 4.2: Typical Monocyte 

4.2.2 Granulocytes 

The granulocytes are divided in 3 classes : Eosinophils, Basophils and Neutrophils. 



Figure 4.3: Typical Eosinophil 

Eosinophils : Eosinophils have compactly packed, red-colored granules in the 
cytoplasm. The nucleus is generally bi-lobed, the lobes being linked by a ribbon-like 
extension. The shape of the cell boundary is generally oval (Figure 4.3). 

Neutrophils : Neutrophils have small purple-colored granules that are loosely 
scattered in the cytoplasm. They have segmented nuclei with 2-5 lobes. The shape 
of the cell boundary is generally oval (Figure 4.4). 
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Figure 4.4: Typical Neutrophil 


Figure 4.5: Typical Basophil 

Basophil : Basophils have large blue-colored granules loosely scattered in the 
cytoplasm. They have bi-lobed nuclei often obscured by granules. The shape of the 
cell boundary is generally oval (Figure 4.5). 

4.3 Cell Features 

Features for discriminating between the different cell classes are based on visual cues 
used by experts. But converting the description precisely into metric is a difficult 
task. Hence, we need to experiment with different possible features and pick the 
best set of features for classification. We adopted the following features: 

• Shape based features 

• Color based features 

• Texture based features 
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4.3.1 Shape based features 


The shape oi an object can be described using shape descriptors [18]. Shape de- 
scriptors may not be an accurate description of shape, but they must be distinct 
enough for different shapes for classification purpose. We use a binary mask of the 
cytoplasm and the nucleus to compute these features. The features used arc : 


Eccentricity : Eccentricity is defined as the ratio between the major and the 
minor fixes. It gives a measure of how close the shape is to a circle. In our case, 
the eccentricity can be computed as the ratio of the eigen values of the covariance 
matrix of the position vectors of the foreground pixels. Let’s say. 


Pi- 


Xi 


Vi 


represent the position vectors. Then, the covariance matrix for it can be given by, 
C = E [{Pi - m,,)(Pi - 

where nip is the mean position vector. Now the eigen values are given by the 
values of A that satisfy. 

Cl) = Au, for non zero values of v 


Eccentricity — 1 — where. Aj > A’j 

Ai 


( 1 ) 


Compactness : Compactness is the ratio of the area to square of the perimeter. 
Area is measured as the count of the foreground pixels for a cell. Perimeter is cal- 
culated as the number of pixels lying on the boundary of a structure. Compactness 
is an index of the extent of indentation of the boundary. The value is high if the 
boundary is smooth, and low otherwise. 
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Compactness 


Area 


Perimeter^ 


( 2 ) 


Area Ratio : Area ratio is counted as number of pixels that make up the nucleus 
to the iiurnbor of pixels that make up the whole cell. i.e. 


AreaRatio = of cytoplasm 

Pixel count of nucleus '' ' 

Number of lobes in the nucleus : Number of lobes that make up the nucleus is one 
of the distinguishing features between granulocytes and non-granulocytes. However, 
in the case of overlapping lobes, declustering needs to be done before counting the 
number of lol)es. 


4.3.2 Color based features 

The color features arc obtained from the segmented nucleus and cytoplasm. Average 
value of each color component, 11, G and B, of the nucleus and cytoplasm are 
compubsl. 


M eanc = 


1 " 
1=1 


(4) 


where, N is total number of pixels in region of interest and C; is the corresponding 
color, R or G or 13, component of pixel. 


4.3.3 Texture based features 

Texture is defined as a function of the spatial variation in the pixel intensities. 
In recent time.s, texture is an important image feature, especially in content-based 
image retrieval systems [19, 28]. In our w^ork, we use texture features computed from 
the gray-level co-occurrence matrix (GLCM), proposed by Haralick [20], to quantify 
texture. 
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Gray-level Co-occurrence Matrix ( GLCM) : Gray-level co-occurrcnco matrix is also 
known as spatial gray level dependence (SOLD). Spatial gray-level co-occurrence 
estimates image properties related to second-order statistic^. As the name suggests, 
the GLCM is constructed from the image by estimating the pairwise statistics of 
the pixel intensity. Each element (i, j) of the matrix represents an estimate of the 
probability that two pixels with a specified separation have gray levels i and j. The 
separation is usually specified by a displacement, d and an angle, B. i.e. 

GLCM,<irid,e) {f{i,j\d,e)\ (5) 

6) will be a square matrix of side equal to the number of gray levels in the image 
and will usually not be symmetric. Symmetry is often introduced by effectively 
adding the GLCM to it’s transpose and dividing every element by 2. This renders 
^{d, 9) and ^{d, 9 + 180°) identical and makes the GLCM unable to detect 180° 
rotations. 

In texture classification, instead of the individual elements of the matrix, the 
features derived from the matrix are used. Haralick et al. [20] proposed 14 features 
from the matrix, out of which only following are used widely. 

• Contrast = J2i,j I* - fihj) 

• Correlation = . O-mO U- 

• Energy = 

• Homogeneity = 

where, f{i,j) represents entry of GLCM i.e. the number of occurrences 

of the pair of gray levels i and j with distance d and angle 0 apart. (//.,, /fj) and 
(ai,aj) represent mean and standard deviation respectively of GLCM along rows 
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and columns and are obtained as : 


i j 

(6) 

7 i 

(7) 

j 

i j 

(8) 


(9) 


j i 


4.4 Conclusion 

For classification the following features were used : 

• Area ratio 

• Eccentricity of the nucleus 

• Eccentricity of the cytophisin 

• Compactness of the nucleus 

• Mean red value of the nucleus 

• Mean green value of the nucleus 

• Mean blue value of the nucleus 

• Mean red value of the cytoplasm 

• Mean green value of the cytoplasm 

• Moan blue value of the cytoplasm 

• Contrast 

• Correlation 

• Energy 

• Homogeneity 
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Discarded features : Although number of lobes in the nucleus is an important 
feature, but in our case, lobes overlapped too much and were difficult to declustcr, 
so we dropped this feature. Whihj compactness of the nucleus is considered as a 
feature, compactness of the cytoplasm is discarded, because in most of the cases, 
the cell boundary is nearly oval and hence compactness of the cytoplasm does not 
serve as a distinguishing feature. 
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Chapter 5 
Classification 


5.1 Introduction 

Classification is the task of assigning to the unknown test sample, a label from 
one of the known classes. The task for a classifier is to evaluate a given feature 
vector and decide the label for the vector. A good feature extractor is an essential 
prerequisite for a classifier, because features with good discrimination can easily 
be labeled using linear classifiers such as the nearest neighbour classifier, but if 
the patterns are very close in feature space then non-linear methods like neural 
networks and support vectors machines are required. The accuracy of the classifier 
highly depends on the quality and the amount of information that the classifier is 
trained with, which can be increased by providing high order feature vectors, but 
increasing the dimension of the feature vector may introduce increase in redundant 
information, which is undesirable. Hence, the feature vector is a trade-off between 
the permissible classification error, the complexity of classifier and the time required 
for classification. 

5.2 Data used for the study 

As per the rule of thumb rule for supervised learning, the number of training pat- 
terns must be 5 to 10 times the dimensionality of the feature vector. But, due to 
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unavailability of sufficient data, we chose a smaller training set. The training data 
consists of 50 samples with 10 samples from each class and the test data consists of 
30 samples 5.1. The test instances used are different from the ones used for training. 


Class 

Training 

Testing 

Basophil 

10 

5 

Eosinophil 

10 

4 

Lymphocyte 

10 

10 

Monocyte 

10 

4 

Neutrophil 

10 

7 

Total 

50 

30 


Table 5.1; Data set used for the experiments 


5.3 Supervised Learning 

The aim of supervised classification is to construct a model for predicting the corr(;c.l 
label for an unseen pattern, on the basis of the feature vector of the test pattern. 
A good supervised learning algorithm leads to precise labeling of unseen samples, 
based on the instances which are presented to the classifier along with their labels 
a priori. Geometry based classifiers such as neural networks and support vector 
machines (SVMs) rely on the estimation of the decision boundaries in actual (or 
higher order) feature space, making use of appropriate error-minimizing criteria. 
However, goodness of such a classifier relies heavily on the time-consuming training 
process. 

5.3.1 Neural Networks 

Neural networks are networks of non-linear computing elements, interconnected 
through adjustable weights. The most popular learning technique for neural net- 
works is ‘feed forward back propagation learning’. Back propagation proceeds by 
comparing the outputs of the network to the expected outputs, and computing an 
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error measure based on sum of square differences. Figure 5.1 shows a sample neural 
network. 


Input 

Layer 



Figure 5.1: A sample neural network 


5.3.2 Support Vector Machines 

Support vector machines [8] are based on the concept of separating hyperplane. 
SVMs achieve classification by finding a separating hyperplane (linear or nonlin- 
ear) in a higher order mapped feature space of the data set. They are modeled 
as optimization problems with quadratic objective function and linear constraints. 
Basically, SVMs try to optimize the margin between classes. The two classes are 
optimally divided by a hyperplane, which does not depend on the probability dis- 
tribution. It is observed that the optimal hyperplane is determined only by a small 
fraction of the data points, called “support vectors” (see figure 5.2). The classifier 
training algorithm is a procedure to find these vectors. 

5.4 Experiments and Results 

A simple feature set consisting of features based on the shape, color and texture is 
used for classification of WBC’s. The individual performance of each category of 
the features is evaluated over the neural networks and the support vector machines. 

We observed that using neural networks the best results are obtained using two 
hidden layers with 6 neurons each, when using feed forward back in opagation neural 


35 




Figure 5.2: Support vectors and separating hyperplane 


tworks. Number of input neurons depends on feature set we are using for the 
periments and number of output neurons were chosen to be 5. In the case of 
pport vector machines, tlie SVM with degree polynomial kernel yielded best 
suits. The performance of the combined features is also studied. Results are shown 
table 5.2. 


Classifier 

SVM 

NNet 

Texture 

36.7 

40.0 

Shape-Color 

70.0 

80.0 

Combined 

70.0 

76.7 


able 5.2: Comparison of classifier performance on the different feature sets (in %) 


We see that shape and color based features perform better than texture features. 
Ids can be explained as follows. In our case, we have a small rectangular window 
ontaining cells, in which texture of different classes of cells can not be captured 
IS discriminant features. Figure 5.3 shows the distribution of 4 texture features for 
50 training samples. We can observe that except for 3^*^ texture feature (energy) 
we don’t see many variations in values of features among different classes, so using 
only texture as features does not provide any discriminant information to train the 
classifier. 
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Figure 5.3: Texture features for all classes 

Also, it can be seen that when texture features are combined with color and 
shape features, we are not getting improvement in classification accuracy, rather 
we observe decline in the accuracy. Including textures as features for classification 
probably introduces redundancy in the feature set, thus making it difficult for any 
classifier to mark the decision boundary between classes. Based on the results we 
obtained, we can say that given shape and color based feature, texture features are 
redundant. The classification performance is good considering limited training data 
and poor image quality. The best performance is obtained with the neural network 
classifier when trained with shape and color based features. 
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Chapter 6 

Conclusion and Future Work 


In this thesis, we propose an automated system to obtain the differential blood 
count, using image processing and machine learning techniques. The system takes 
as input, color images of blood smears and determines the classes of the WBC’s. 

We present an effective two-stage segmentation technique. At first, fc-means is 
performed on the HSV-equivalent of the image in order to locate WBC’s in the 
image. A second level of segmentation is performed using autothresholding followed 
by declustering using the watershed algorithm to achieve finer level of segmentation 
and to facilitate segmentation of the cytoplasm and the nucleus. The features chosen 
for classification were based on shape, color and texture of the segmented nucleus 
and the cytoplasm. Support vector machine and neural network classifiers were tried 
on different combination of feature sets of these color and shape based features using 
the neural network classifier has been observed to be the most effective. The peak 
classification accuracy of 80% was obtained using a neural network based classifier 
with color and shape based features. 

The performance obtained is not par with the state-of-the-art works. Katz [24] 
obtained peak accuracy of 98% with a feature set based on cell color, size and nuclear 
morphological information, with the dataset containing more than 200 instances of 
different cells. Ongun et al. [30] achieved accuracy of 91% using support vector ma- 
chine with 57-dimensional feature vectors which included color statistics, texture, 
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shape features and color histogram. Song et al. [42], using context-based classifica- 
tion, obtained accuracy of 91%, where huge training data consisted blood samples 
from 220 speciinens (consisting of 13,200 cells). In our work, considering the paucity 
of data and relatively poor quality of input images, our proposed technique seems 
promising. 

6.1 Future Work 

• The segmentation scheme in our work is effective, but is unable to handle 
overlapping cells. The scheme can be enhanced by including techniques for 
declustering, leading to segmentation of overlapping cells as well. 

• A method identifying and rejecting unsatisfactorily segmented images needs 
to be devised. 

• Feature selection can be incorporated in order to eliminate the redundant and 
confusing features. 

• Our work focuses on the mature classes of cells only, but can be extended for 
the immature classes of cells also. 
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